Skip to content

DAOS-18949 container: Fix sched_seq assert failures#18269

Merged
gnailzenh merged 1 commit into
masterfrom
liw/cont-start-rc
May 26, 2026
Merged

DAOS-18949 container: Fix sched_seq assert failures#18269
gnailzenh merged 1 commit into
masterfrom
liw/cont-start-rc

Conversation

@liw
Copy link
Copy Markdown
Contributor

@liw liw commented May 18, 2026

The assertion "sched_seq1 != sched_seq2" in cont_child_create_start has been triggered likely by the following scenario:

cont_child_create_start
  cont_child_start (the one near the beginning)
    if cont_child->sc_destroy
      returned -DER_CONT_NONEXIST
  vos_cont_create: -DER_EXIST
  assertion failed due to no schduling

This patch changes cont_child_start and ds_cont_child_lookup to return -DER_CONT_DESTROYING instead of -DER_CONT_NONEXIST, so that the scenario above won't reach the vos_cont_create call.

Features: container

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@github-actions
Copy link
Copy Markdown

Ticket title is 'erasurecode/multiple_rank_failure.py:EcodOnlineMultiRankFail.test_ec_multiple_rank_failure - timeout destroying container in tearDown'
Status is 'In Progress'
Labels: 'ci_master_weekly,weekly_test'
https://daosio.atlassian.net/browse/DAOS-18949

@liw liw force-pushed the liw/cont-start-rc branch 2 times, most recently from 6a51f89 to a0f84ea Compare May 18, 2026 23:41
@liw
Copy link
Copy Markdown
Contributor Author

liw commented May 18, 2026

First NLT, then NTL memcheck broke; adding Allow-unstable-test: true.

@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Unit Test bdev completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-18269/3/display/redirect

@daosbuild3
Copy link
Copy Markdown
Collaborator

The assertion "sched_seq1 != sched_seq2" in cont_child_create_start has
been triggered likely by the following scenario:

  cont_child_create_start
    cont_child_start (the one near the beginning)
      if cont_child->sc_destroy
        returned -DER_CONT_NONEXIST
    vos_cont_create: -DER_EXIST
    assertion failed due to no schduling

This patch changes cont_child_start and ds_cont_child_lookup to return
-DER_CONT_DESTROYING instead of -DER_CONT_NONEXIST, so that the scenario
above won't reach the vos_cont_create call.

Features: container
Allow-unstable-test: true
Signed-off-by: Li Wei <liwei@hpe.com>
@liw liw force-pushed the liw/cont-start-rc branch from a0f84ea to 0db65d2 Compare May 20, 2026 00:48
@liw
Copy link
Copy Markdown
Contributor Author

liw commented May 20, 2026

Now Fault Injection and Test RPM failures, sigh; rebasing.

@liw liw marked this pull request as ready for review May 20, 2026 01:36
@liw liw requested review from a team as code owners May 20, 2026 01:36
@liw liw requested review from liuxuezhao and wangshilong May 20, 2026 01:36
@liw
Copy link
Copy Markdown
Contributor Author

liw commented May 20, 2026

Requesting reviews early, since after 4 builds still 0 CI coverage.

Copy link
Copy Markdown
Contributor

@wangshilong wangshilong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this.

@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18269/5/execution/node/1384/log

@liw
Copy link
Copy Markdown
Contributor Author

liw commented May 22, 2026

Build 5

  • [Features: container] container/boundary: DAOS-18610 (no sched_seq assertion failures)

@liw
Copy link
Copy Markdown
Contributor Author

liw commented May 23, 2026

erasurecode/multiple_rank_failure: all 3 repeats passed:

https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-18324/2/artifact/Functional%20Hardware%20Large%20MD%20on%20SSD/erasurecode/multiple_rank_failure.py/ (triggered with another PR equivalent to build 5 of this PR)

@liw liw requested a review from a team May 23, 2026 00:33
@liw liw added the forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. label May 23, 2026
@liw
Copy link
Copy Markdown
Contributor Author

liw commented May 23, 2026

Please see my last two comments on the latest round of testing. Thanks.

@gnailzenh gnailzenh merged commit 630306c into master May 26, 2026
39 of 41 checks passed
@gnailzenh gnailzenh deleted the liw/cont-start-rc branch May 26, 2026 02:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed.

Development

Successfully merging this pull request may close these issues.

5 participants